12 research outputs found

    Thematically Reinforced Explicit Semantic Analysis

    Full text link
    We present an extended, thematically reinforced version of Gabrilovich and Markovitch's Explicit Semantic Analysis (ESA), where we obtain thematic information through the category structure of Wikipedia. For this we first define a notion of categorical tfidf which measures the relevance of terms in categories. Using this measure as a weight we calculate a maximal spanning tree of the Wikipedia corpus considered as a directed graph of pages and categories. This tree provides us with a unique path of "most related categories" between each page and the top of the hierarchy. We reinforce tfidf of words in a page by aggregating it with categorical tfidfs of the nodes of these paths, and define a thematically reinforced ESA semantic relatedness measure which is more robust than standard ESA and less sensitive to noise caused by out-of-context words. We apply our method to the French Wikipedia corpus, evaluate it through a text classification on a 37.5 MB corpus of 20 French newsgroups and obtain a precision increase of 9-10% compared with standard ESA.Comment: 13 pages, 2 figures, presented at CICLing 201

    A Semantic Relatedness Measure Based on Combined Encyclopedic, Ontological and Collocational Knowledge

    Full text link
    We describe a new semantic relatedness measure combining the Wikipedia-based Explicit Semantic Analysis measure, the WordNet path measure and the mixed collocation index. Our measure achieves the currently highest results on the WS-353 test: a Spearman rho coefficient of 0.79 (vs. 0.75 in (Gabrilovich and Markovitch, 2007)) when applying the measure directly, and a value of 0.87 (vs. 0.78 in (Agirre et al., 2009)) when using the prediction of a polynomial SVM classifier trained on our measure. In the appendix we discuss the adaptation of ESA to 2011 Wikipedia data, as well as various unsuccessful attempts to enhance ESA by filtering at word, sentence, and section level.Comment: 6 pages, 6 figures, accepted for publication at IJCNLP2011 Conferenc

    OASIS at NTCIR-6: on-line query translation for Chinese-Japanese cross-lingual information retrieval

    No full text
    This paper reports results of Chinese – Japanese CLIR experiments using on-line query translation techniques. Approaches to employ English as a pilot language and to utilize several on-line translation systems are introduced. They were tested on NTCIR – 3, 4, 5, and 6 collections. Proposed procedures can be helpful under certain circumstances

    Tool to Retrieve Less-Filtered Information from the Internet

    No full text
    While users benefit greatly from the latest communication technology, with popular platforms such as social networking services including Facebook or search engines such as Google, scientists warn of the effects of a filter bubble at this time. A solution to escape from filtered information is urgently needed. We implement an approach based on the mechanism of a metasearch engine to present less-filtered information to users. We develop a practical application named MosaicSearch to select search results from diversified categories of sources collected from multiple search engines. To determine the power of MosaicSearch, we conduct an evaluation to assess retrieval quality. According to the results, MosaicSearch is more intelligent compared to other general-purpose search engines: it generates a smaller number of links while providing users with almost the same amount of objective information. Our approach contributes to transparent information retrieval. This application helps users play a main role in choosing the information they consume

    Query Translation for CLIR: EWC vs. Google Translate

    No full text
    International audienceA new approach to find accurate translation of search engine queries from Japanese into English for the CLIR task is proposed. The Mecab system and online dictionary SPACEALC are utilized to segment Japanese queries and to get all possible English senses for every term detected. To disambiguate terms, the idea of the shortest path on an oriented graph is applied. Nodes of this graph symbolize word senses and edges connect nodes representing neighboring Japanese terms. The EWC semantic relatedness measure is used to select the most related meanings for the translation results. This measure combines the Wikipedia-based Explicit Semantic Analysis measure, the WordNet path measure and the mixed collocation index. The proposed technique is tested on the NTCIR data collection. Queries generated by Google Translate were used to evaluate the quality of translation

    Accurate Query Translation for Japanese-English Cross-Language Information Retrieval

    No full text
    International audienceIn this paper, a novel approach to translate queries from Japanese into English for the CLIR task is discussed. To get all possible English senses for every Japanese term, the online dictionary SPACEALC is utilized. The EWC semantic relatedness measure is used to select the most related meanings for the results of translation. This measure combines the Wikipedia-based Explicit Semantic Analysis measure, the WordNet path measure and the mixed collocation index. The preliminary tests of the proposed technique are done utilizing the NTCIR data collection. The performance of retrieval is compared with the variant of retrieval using queries generated by Google Translate

    A Query Expansion Technique Using the EWC Semantic Relatedness Measure

    No full text
    International audienceThis paper analyses the efficiency of the EWC semantic relatedness measure in an ad-hoc retrieval task. This measure combines the Wikipedia-based Explicit Semantic Analysis (ESA) measure, the WordNet path measure and the mixed collocation index. EWC considers encyclopaedic, ontological, and collocational knowledge about terms. This advantage of EWC is a key factor to find precise terms for automatic query expansion. In the experiments, the open source search engine Terrier is utilised as a tool to index and retrieve data. The proposed technique is tested on the NTCIR data collection. The experiments demonstrated superiority of EWC over ESA
    corecore